The purpose of this analysis is to explore a dataset featuring characteristics about red wine.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
str(RedWine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The red wine data set contains nearly 1600 observations of 13 variables.
We can see the distribution of quality ratings has a minimum of 3 and a maximum of 8, with most ratings at 5 or 6. Surprisingly, there are no ratings of 1, 2, 9, or 10. I would have expected a larger range of quality ratings with such a large data set.
I divided the data by quality level. Low: Quality of 0-3, Medium: Quality of 4-6, High: Quality of 7-10. We can see that the vast majority of observations fall in the medium quality level.
We can see that the Density and pH plots are the most normally distributed. Thee majority of pH levels fall between 3.0 - 3.5. Many of the plots are skewed to the right, including Free Sulfur Dioxide, Total Sulfur Dioxide. The majority of wines havie less than 100 in total sulfur dioxide. Several of the plots are long tailed, such as Residual Sugar and Chlorides.
The above plots compare the variables before and after transformation. The data for residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide become more normally distributed after applying log10.
There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, and quality). All of the variables are numeric with the exception of quality, which is in integer form.
Most of the wines have a quality of 5 or 6.
The 3rd quartile of residual sugar levels is 2.6, although there are a few major outliers, with the maximum residual sugar level of 15.5. I’m interested to see if higher residual sugar wines tend to have lower or higher quality.
Most wines have an alcohol content of less than 12%. This surprises me, given that the majority of red wines I’m familiar with have alcohol contents above 13.5%.
Many of the wines have 0 citric acid.
The main feature of interest in the dataset are quality, and I’d like to determine which variables impact quality ratings the most. I suspect alcohol, residual sugar, and pH contribute to quality ratings, as they seem to be features you may be able to decipher during wine tastings.
From research into what contributes to the taste of wine, I discovered that sweetness, acidity, tannin, alcohol, and body are the main features. In addition to pH, I think fixed acidity and volatile acidity may contribute to the acidity of wine.
Yes, I created a new variable called quality level, which cuts the quality levels into low (3, 4), medium (5, 6), and high (7,8).
I deleted column X because it was simply a repeat of the index.
I applied log10 to residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide in order to normalize the distributions.
I analyzed the following bivariate relationships:
Quality vs. Alcohol Quality vs. pH Quality vs. Residual Sugar Quality vs. Fixed Acidity Quality vs. Volatile Acidity Residual Sugar vs. Alcohol Residual Sugar vs. pH pH vs. Alcohol Fixed Acidity vs. Density Fixed Acidity vs pH pH vs. Citric Acid Quality Level vs. Alcohol Quality Level vs. pH Quality Level vs. Residual Sugar
The correlogram indicates that the majority of relationships between variables are not highly correlated. The strongest relationships appear to be density vs. fixed acidity (r^2 = 0.67), citric acid vs. fixed acidity (r^2 = 0.67, pH vs. fixed acidity (r^2 = -0.68), and total sulfur dioxide vs. free sulfur dioxide (r^2 = 0.67). A correlation between citric acidy and fixed acidity is not surprising as they are both acids. Free sulfur dioxide is part of total sulfur dioxide so a correlation is expected. pH is a measure of acidity so the correlation between pH and fixed acidity is not surprising either. I am unsure what would cause a correlation between density and fixed acidity, but it could be that higher acidic liquid is more dense than lower acidic liquid.
This plot analyzes alcohol content across qualities. Red points represent mean alcohol content by quality and blue points represent outliers. The plot confirms a moderate correlation (r^2 = 0.48). Alcohol content seems to rapidly increase between the moderate and high quality wines.
This plot analyzes pH across qualities. Red points represent the mean pH by quality and blue points represent outliers. The quality vs. pH plot does not seem to indicate any trend or correlation, supporting the r^2 value of 0.06.
This plot analyzes residual sugar on a log10 scale across qualities. Red points represent mean residual sugar by quality and blue points represent outliers. The boxplot of residual sugar across qualities plot does not indicate a correlation between residual sugar and quality, confirming the r^2 value of 0.01.
This plot visualizes fixed acidity across qualities. Red points represent the mean fixed acidity by quality and blue point represent outliers. The boxplot of fixed acidity across qualities shows a very weak positive correlation, confirming the low r^2 value 0f 0.12.
The quality vs. volatile acidity plot shows a moderate negative correlation. This is confirmed by the r^2 value of -0.39.
The residual sugar by alcohol plot does not show a clear trend, which is confirmed by the low r^2 value of 0.04. This is surprising given that lower alcohol wines tend to taste sweeter, leading one to believe that they contain more sugar.
This plot visualizes residual sugar on a log10 scale by pH. There doesn’t appear to be a notable correlation between pH and residual sugar, which is supported by the -0.09 r^2 value.
The pH by alcohol content plot indicates a slight positive correlation, which is confirmed by the 0.21 r^2 value.
The density by fixed acidity plot indicates a strong positive correlation. The smoother highlights this trend. The r^2 value of 0.67 is one of the strongest correlations of any of the variables in the data set.
The pH by fixed acidity plot shows a strong negative correlation. The smoother and the r^2 value of -0.68 confirms this trend. This trend is not surprising given that pH is a measure of acidity and a lower pH indicates higher acidity.
This boxplot visualizes alcohol content by quality level. Outliers are plotted in red. It is interesting that the alcohol content tends to be much higher in the higher quality wines than the medium or low quality wines.
This boxplot visualizes the pH level by quality level. Outliers are plotted in red. The median pH level decreases as the quality increases. The range of the data is lower at the highest quality level.
This boxplot represents residual sugar by quality level. Outliers are plotted in red. The range of outliers is large in this plot, especially in the medium quality level. All of the outliers in all quality levels are high outliers; they have very high levels of residual sugar rather than very low levels of residual sugar.
This boxplot visualizes volatile acidity by quality level. Outliers are plotted in red. There is a clear relationship between quality level and volatile acidity. Both the IQR and median volatile acidity decrease as quality level increases.
The main feature of interest in this analysis is quality, and if any features show an affect on quality. The correlogram shows that the highest r^2 value between quality and any other feature is alcohol (r^2 = 0.48). The alcohol content by quality boxplot highlights this trend. The alcohol content by quality level boxplot shows this relationship even better, with a clear increase in median alcohol levels in the highest quality wines.
The relationship between quality and pH has an r^2 value of 0.06, which indicates practically zero correlation, and the corresponding boxplot confirms this. However, when pH is compared to quality levels, there is a pattern in the boxplot. The median pH levels seem to decrease as the quality level increases, especially between the lower quality and medium quality wines.
The correlogram indicates that there is no correlation (r^2 = 0.01) between residual sugar and quality. Even after transforming the residual sugar data using log10 and plotting it against quality levels, there seems to be no clear relationship betwen residual sugar and quality.
One of the most poignant bivariate relationships discovered was the relationship between quality level and volatile acidity. The correlogram shows an r^2 value of -0.39 between quality and volatile acidity. The volatile acidity by quality boxplot visualizes how as quality increases, volatile acidity decreases. This trend is highlighted further when quality is grouped into levels.
Some of the most interesting relationships were between variables that were not the main feature of interest. In fact, three of the four strongest r^2 values included fixed acidity vs. another variable. Fixed acidity had the strongest correlations with density, citric acid, and pH. As discussed previously, this is not surprising given that many of the variables are either acid themselves or a measure of acidity.
The strongest relationship, according to the r^2 value, is between pH and fixed acidity (r^2 = 0.68). However, once the data was cut into quality levels, the plots indicate that there are strong relationships between quality level and alcohol, quality level and volatile acidity, and quality level and pH.
Because the majority of the data has a medium quality level, the data is highly clustered. The smoother shows a slightly higher fixed acidity vs. density ratio for higher quality level wines vs. medium or lower quality level wines.
This plot doesn’t show strong trends, but it does show how the majority of the data falls in the lower alcohol, lower pH quadrant of the chart.
This plot shows some differences in the relationship between pH and citric acid by quality level. The quality level of wine appears to correlate with the level of citric acid for any given pH.
The relationship between pH and fixed acidity seems to be uniform across the high and medium quality wines. The low quality wines follow a less linear pattern, but this may be attributed to the limited low quality data points.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = RedWine)
## m2: lm(formula = quality ~ alcohol + pH, data = RedWine)
## m3: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)),
## data = RedWine)
## m4: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar, data = RedWine)
## m5: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity, data = RedWine)
## m6: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity, data = RedWine)
## m7: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)),
## data = RedWine)
## m8: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide, data = RedWine)
## m9: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid, data = RedWine)
## m10: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)),
## data = RedWine)
## m11: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)) +
## I(log10(free.sulfur.dioxide)), data = RedWine)
## m12: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) +
## residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) +
## total.sulfur.dioxide + citric.acid + I(log10(chlorides)) +
## I(log10(free.sulfur.dioxide)) + sulphates, data = RedWine)
##
## ==========================================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 4.426*** 4.526*** 4.539*** 3.014*** 3.559*** 3.971*** 3.834*** 3.901*** 3.871*** 4.036*** 3.359***
## (0.175) (0.387) (0.393) (0.393) (0.601) (0.576) (0.607) (0.605) (0.606) (0.605) (0.610) (0.606)
## alcohol 0.361*** 0.386*** 0.389*** 0.391*** 0.386*** 0.330*** 0.321*** 0.325*** 0.330*** 0.321*** 0.315*** 0.292***
## (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.017) (0.018) (0.018) (0.018)
## pH -0.850*** -0.870*** -0.872*** -0.506** -0.259 -0.288 -0.426** -0.456** -0.513** -0.524** -0.487**
## (0.116) (0.116) (0.116) (0.159) (0.154) (0.154) (0.157) (0.158) (0.161) (0.160) (0.158)
## I(log10(residual.sugar)) -0.171 -0.495 -0.728* -0.199 -0.120 -0.142 -0.127 -0.084 -0.031 0.057
## (0.114) (0.313) (0.320) (0.309) (0.311) (0.309) (0.309) (0.310) (0.310) (0.305)
## residual.sugar 0.038 0.059 0.012 0.009 0.020 0.020 0.018 0.011 0.009
## (0.034) (0.035) (0.033) (0.033) (0.033) (0.033) (0.033) (0.033) (0.033)
## fixed.acidity 0.047*** 0.023 0.018 0.009 0.022 0.018 0.018 0.017
## (0.014) (0.014) (0.014) (0.014) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.249*** -1.255*** -1.233*** -1.332*** -1.282*** -1.253*** -1.114***
## (0.101) (0.101) (0.100) (0.117) (0.120) (0.120) (0.120)
## I(log10(total.sulfur.dioxide)) -0.124* 0.424** 0.405** 0.429** 0.210 0.103
## (0.058) (0.144) (0.145) (0.145) (0.178) (0.175)
## total.sulfur.dioxide -0.006*** -0.005*** -0.006*** -0.005*** -0.004**
## (0.001) (0.001) (0.001) (0.001) (0.001)
## citric.acid -0.234 -0.170 -0.134 -0.226
## (0.142) (0.146) (0.147) (0.145)
## I(log10(chlorides)) -0.248 -0.244 -0.552***
## (0.132) (0.131) (0.135)
## I(log10(free.sulfur.dioxide)) 0.203* 0.198*
## (0.095) (0.094)
## sulphates 0.813***
## (0.108)
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.252 0.253 0.254 0.259 0.324 0.326 0.333 0.334 0.336 0.338 0.361
## adj. R-squared 0.226 0.251 0.252 0.252 0.257 0.322 0.323 0.330 0.331 0.332 0.333 0.356
## sigma 0.710 0.699 0.699 0.699 0.696 0.665 0.664 0.661 0.661 0.660 0.659 0.648
## F 468.267 268.888 180.161 135.446 111.286 127.233 109.959 99.324 88.685 80.298 73.574 74.568
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1694.466 -1693.325 -1692.710 -1687.117 -1613.455 -1611.152 -1602.601 -1601.237 -1599.456 -1597.172 -1568.959
## Deviance 805.870 779.508 778.397 777.799 772.376 704.393 702.367 694.895 693.711 692.167 690.192 666.261
## AIC 3448.114 3396.931 3396.650 3397.421 3388.234 3242.909 3240.303 3225.203 3224.475 3222.913 3220.344 3165.918
## BIC 3464.245 3418.440 3423.536 3429.684 3425.874 3285.926 3288.697 3278.974 3283.623 3287.438 3290.247 3241.198
## N 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
## ==========================================================================================================================================================================================================
When looking at fixed acidity vs. citric acid in terms of quality levels, there does seem to be a relationship. For any given pH, the citric acid level appears to increase as the quality level increases. This relationship isn’t as clear as the pH level increases above 3.5. This may be because there are fewer data points at that level.
In the density vs. fixed acidity in terms of quality level plot, there is a very strong relationship. For a given fixed acidity level below 12, the average density level is higher for lower quality wines than higher quality wines.
In the density vs. fixed acidity in terms of quality level plot, the smoother for the lowest quality wines does not appear linear. It almost appears logarithmic, rising in density value slower as the fixed acidity value increases.
I created a model to predict quality with several variables, including alcohol, pH, residual sugar, fixed acidity, volatile acidity, total sulfur dioxide, citric acid, chlorides, free sulfur dioxide, and sulphates. The r^2 value of the model is 0.361. Because quality ratings are chosen by humans and are not scientific, an r^2 value of 0.361 is relatively strong.
The number of observations with 3, 4, 7, and 8 quality ratings is so much lower than the number of observations with 5 and 6 quality ratings. A bigger overall data set and more observations in the lower and higher quality ratings would improve the model.
This plot visualizes alcohol content by quality level. Red points represent outliers and blue points represent the mean alcohol content by quality level. We can see a similar mean alcohol content for both low and medium quality wines. Interestingly, the mean alcohol content spikes much higher for the highest quality wines.
This plot visualizes volatile acidity by quality level. Red points represent outliers and blue points represent the mean volatile acidity by quality level. We can see a very strong negative relationship between quality level and volatile acidity. As the quality level increases, both the mean volatile acidity and the IQR decreases.
This plot is notable because it visualizes the relationship between the two variables with the strongest correlation in the data set. Fixed acidity and pH have an r^2 value of -0.68. This relationship makes a lot of sense because a pH level is a measure of acidity; a lower pH indicates a substance is more acidic and a higher pH indicates a substance is more basic. This plot confirms this relationship.
The goal of this exploratory data anlysis was to determine which variables most impacted quality. There were several insights I found during the exploration of this data set.
Alcohol and Quality: There is a clear relationship between quality level and alcohol content, but only for the highest quality wines.
pH and Quality: There is a negative relationship between pH and quality levels. This is unclear until quality is separated into quality levels.
Volatile Acidity and Quality: There is a very strong relationship between volatile relationship and quality.
The correlogram was very helpful in showing correlations between variables except for quality. In several cases, the relationship between quality and a specific variable was unclear until quality was separated into quality levels.
It would be useful to know the type of red wine, such as Cabernet Sauvignon, Pinot Noir, Merlot, etc. It is very difficult to analyze trends when the type of the wine is unknown. For example, a certain wine type may be expected to have more alcohol and therefore someone rating the quality of that wine would rate it more favorably than someone rating a quality of wine that was expected to have a lower alcohol content. Furthermore, it would be interesting to analyze wines from different parts of the world to see if there is a relationship between quality or any of the variables and region.